Multi-breed genomic predictions and functional variants for fertility of tropical bulls

Worldwide, most beef breeding herds are naturally mated. As such, the ability to identify and select fertile bulls is critically important for both productivity and genetic improvement. Here, we collected ten fertility-related phenotypes for 6,063 bulls from six tropically adapted breeds. Phenotypes were comprised of four bull conformation traits and six traits directly related to the quality of the bull’s semen. We also generated high-density DNA genotypes for all the animals. In total, 680,758 single nucleotide polymorphism (SNP) genotypes were analyzed. The genomic correlation of the same trait observed in different breeds was positive for scrotal circumference and sheath score on most breed comparisons, but close to zero for the percentage of normal sperm, suggesting a divergent genetic background for this trait. We confirmed the importance of a breed being present in the reference population to the generation of accurate genomic estimated breeding values (GEBV) in an across-breed validation scenario. Average GEBV accuracies varied from 0.19 to 0.44 when the breed was not included in the reference population. The range improved to 0.28 to 0.59 when the breed was in the reference population. Variants associated with the gene HDAC4, six genes from the spermatogenesis-associated (SPATA) family of proteins, and 29 transcription factors were identified as candidate genes. Collectively these results enable very early in-life selection for bull fertility traits, supporting genetic improvement strategies currently taking place within tropical beef production systems. This study also improves our understanding of the molecular basis of male fertility in mammals.


Introduction
Under a sustainable intensification framework for livestock production, factors that affect the health of humans and animals, and the environment are considered [1]. Bull fertility has such a major impact on the entire beef production system that it is a key phenotype for improvement and an opportunity to assist in meeting the world's increasing demand for animal protein [2]. Bull fertility is an important economic driver of profitability for beef cattle operations in tropical regions such as northern Australia. Different cattle are better adapted to different environments, resulting in within and between breed variations in fertility traits in response to adverse environmental conditions (e.g. hot humid conditions) [3,4]. The identification of genetic variants that affect fertility traits, and their application in animal selection would assist in better matching the animal type, production system, and the breeding environment, which can be considered a sustainability goal. An understanding of male physiology is available [5][6][7], some relevant genes that affect fertility traits have been identified [8,9] and genetic parameters have been estimated [10,11], mainly using pedigree information. However, the implementation of selection approaches for animal improvement, including genomic selection-based programs, have been limited. Compared to beef cow fertility, which has received much consideration and has been managed through the development of reproductive biotechnologies [12,13] and early stages implementation of genomic-based selection [14,15], bull fertility lags behind. Currently, in beef cattle, scrotal circumference (SC) is the only bull fertility-related phenotype included in most genetic/genomic evaluation programs. However, during the conduct of a typical bull breeding soundness examination (BBSE) [16] a range of physical, and semen quality traits are objectively assessed. Given the latter has been shown to be heritable [11,[17][18][19], these traits should be included in male fertility selection programs. This is particularly relevant in tropical environments where there is a risk of high body heat load adversely affecting semen quality.
The multiplicity of breeds used in tropical beef production systems limits the prospect to assemble the large reference populations needed for accurate genomic estimated breeding values (GEBV) for bull fertility. To address this limitation, strategies combining reference populations across breeds using imputed high-density SNP (single nucleotide polymorphisms) genotypes to produce accurate multi-breed GEBV are currently being investigated [15,20,21]. Here, we merged six breeds widely used in northern Australia and (sub)tropical regions worldwide to assemble a population of 6,063 bulls measured for ten fertility traits, including four observations on the bull and six observations on the bull's semen. Using imputed genotypes at high-density (680,758 SNPs), we investigate the accuracy and bias of GEBV using two validation strategies: one where all the phenotypes from one breed are missing from the reference; and the other representing a 5-way cross-validation scheme where a random 20% of phenotypes from all populations are missing from the reference.
This rich dataset was also used to dissect the genomics of bull fertility. Using a 3-way combination of gene network connectivity, pleiotropy, and gene expression in sperm, we identified several candidate functional variants associated with bull fertility traits. Future work could test if these variants aid across breed predictions for bull fertility traits.

Analysis overview
Our approach was to assemble a large reference population of tropically adapted bulls with fertility-related phenotypes and genotyped at high-density to be used to generate multi-breed GEBVs with useful accuracies. A further aim was to explore the factors affecting the accuracy and bias of GEBVs and the identification of functional variants. Our analysis had 4 major steps (detailed description in Materials & Methods): 1. In the Reference step (Fig 1A), a multi-breed population was assembled with 6,063 bulls from six tropically adapted breeds Brahman (BRM), Tropical Composite (TRC), Santa Gertrudis (SGT), Droughtmaster (DMT), Ultra Black (UBK) and Belmont Tropical Composite (BTC) with measures on ten fertility-related phenotypes including measures on the bull Weight (WT, Kg) and body condition score (COND, score) at the same day (or close to) the observation of the other traits, scrotal circumference (SC, cm), sheath score (SHEATH, score) and six measures on the bull's semen. The semen traits were density (DENS, score), mass activity (MASS, score), percentage of progressive motility (MOT, %), percentage of normal sperm (PNS, %), percentage of sperm cells with proximal cytoplasmic droplets (PD, %) and percentage of sperm cells with middle-piece abnormalities (MD, %). These traits were observed once, primarily collected as routine evaluation and preparation for the bull sale.
2. In the Inference step ( Fig 1B), a total of 351 analytical models based on GREML approaches were undertaken to estimate variance components, including heritabilities (h 2 ) and genetic correlations, as well as to obtain GEBV. The GREML analyses were performed using a multivariate model using the whole dataset, allowing the estimation of GEBV Whole (orû w in numerical notation, see Methods) and in uni-variate models for cross-validation after

Genetic parameters
The number of records per breed varied from 660 (BTC) to 1,819 (TRC) (S1 Table in S1 File). Descriptive statistics of each trait are presented in S2 Table in S1 File. The traits data were adjusted, before the genetics analyses, for non-genetic effects of population, year of birth, cohort, and the first two principal components (more information on Material and Methods).
Multi-breed estimates of genomic heritabilities, genetic and residual correlations are given in Table 1. Using the multi-breed genomic relationship matrix (GRM) (S1 Fig in S1 File) across the 6,063 bulls, 45 bi-variate analyses were performed for as many pair-wise trait combinations using adjusted phenotypes. Each trait was included in nine analyses (one with each of the remaining traits) and the heritabilities listed in Table 1 correspond to the average heritability estimated across the nine analyses. These reported heritability estimates are well within those published in the literature for the same traits and, on occasions using a subset of this population. For instance, using a population of Brahman (BRM) and Tropical Composite (TRC) bulls, Porto-Neto et al. [23] reported a heritability estimate for sheath score of 0.51 and 0.57 for BRM and TRC, respectively, and very similar to the 0.57 reported here. Similarly, and more recently, for the semen traits, Fortes et al. [19] reported a heritability estimate for PNS (percent normal sperm) of 0.35 and 0.29 for BRM and TRC bulls, slightly higher than the 0.24 found here. This study expanded previous analyses by including cattle breeds not evaluated before, further validating the genetic relationship between semen traits. The positive correlations (both genetic and residual) between weight, condition score and scrotal circumference have been reported in the past [17], as it was the positive genetic correlations observed for density, mass, motility, and percent normal sperm and the genetically negative correlation between the semen defect traits of PD (proximal cytoplasmic droplets) and MP (midpiece abnormalities) [19]. Importantly, PD and MP were almost uncorrelated (genetic and residual correlation of 0.06 and 0.01, respectively), indicating that these two sperm defects are independent of each other. As such, these genetic parameter estimates offer hope for the possibility of implementing genetic selection programs for these bull fertility traits.
Using bivariate models, we estimated the genomic correlation of the same trait observed in different breeds (Table 2). Except for scrotal circumference (SC) and sheath score, for which the estimated genomic correlations were moderate and positive (i.e., averaged across the 15 breed pairs of 0.36 and 0.51 for SC and sheath, respectively), all other estimates were near zero suggesting a limited contribution of genomic information across breeds. The strong genomic correlations for SC and sheath score might hint at the presence of genomic regions segregating in the populations with a large effect for each trait. Averaged across the 50 bi-variate analyses in which each breed was involved (10 traits × 5 comparing breeds), the estimated genomic correlation was 0.10, 0.13, 0.13, 0.08, 0.09 and 0.09 for BRM, TRC, SGT, DMT, UBK and BTC, respectively. The highest average genomic correlation observed for TRC and SGT suggest GEBVs from these breeds would be more robust in a multi-breed scenario.

Genomic predictions
We did not observe breed differences for the average GEBV within a trait as they were all nonsignificantly different from zero. However, there were some differences in the GEBV variation across breeds and these differences could reflect different accuracies with higher variations associated with higher accuracies.
Two schemes were used to estimate GEBV accuracies (Table 3): #1) removed a single population completely from the reference and tested the accuracy in predicting its breeding values, and #2) 20% of the observed data were set to missing in five-way cross-validation. GEBV accuracy estimates for a breed, when the breed is not represented in the reference, were lower than those when some animals of the breed are included in the reference (comparison between scheme 1 vs 2), with the largest impact on BRM. This observation was expected, given the known relationship between accuracy and genetic distance to the reference population for a given test animal [24]. Moreover, BRM is the most divergent breed among the six populations (Fig 2), even though it was used during the formation of some of the other breeds.
To further understand the factors affecting the properties of the multi-breed GEBV reported here, we employed an ANOVA model that contained the effects of heritability estimate, breed, and phenotypic trait as well as the cross-validation sample for the case of the second validation scheme. The results are presented in S3 Table in S1 File. The predictive ability of the model as measured by the coefficient of determination (R 2 ) was highest for the accuracy estimated using the method LR (ACC LR ) than for traditional accuracy (ACC R ), being 85.1% and 71.9% for validation schemes #1 and #2, respectively. The bias was not affected by any effect included in the ANOVA model indicating the random nature of its variation. Trait and breed were significant sources of variation for both measures of accuracy (S3 Table in S1 File) and dispersion. As in recent attempts at generating multi-breed GEBV [15], genomic linkages can be exploited through the multi-breed GRM. However, the GEBVs produced here are on a different and arbitrary base for each breed. To generate GEBVs on the same base for all breeds, contemporary groups consisting of animals of multiple breeds are required. A promising alternative would be to employ the metafounders approach [25,26].

Functional variants
One of the challenges for the implementation of genomic selection strategies in livestock, especially in multi-breed scenarios, is the ability of accurately estimate breeding values across populations and breeds. By focusing more on genomic markers in strong LD with causative variants in combination with denser marker sets or functional subsets of markers, it is possible to leverage information across distantly related breeds to increase the accuracy of genomic predictions [15,20,21,27]. The Association Weight Matrix (AWM) approach exploits patterns of SNP co-association across multiple traits to form connections between the genes harbouring those SNP. The resultant network can be subject to topological analysis that is thought to provide biological insights over and above what can be inferred using traditional approaches such as GWAS. The AWM captured 3,479 SNP-Genes scattered genome-wide, of which 161 genes were annotated as transcription factors (TF). When the AWM was subjected to the PCIT network inference algorithm, 914,955 significant connections between genes were identified. Gene HDAC4 (causal candidate chr3:118,220,595, around 4.8Kbp from the coding gene) was found to have the highest degree (i.e., the higher number of significant connections) with 1,200 connections, making it a key causal candidate. The class II histone deacetylase HDAC4 regulates endocrine functions including male fertility and appetite [28,29].
With undisputed importance in male fertility, the spermatogenesis-associated (SPATA) family consists of 25 genes (out of 27,607 annotated genes in the bovine genome). Six SPATA genes were captured by the AWM, representing a significant enrichment (P = 0.03,  . Of these, we highlight SPA-TA6L with 1,139 network connections and its strongest association with the trait PNS, and SPATA16 with 1,058 connections and strongest association with the trait MASS. In fact, a missense variant (p.Ile193Met) in SPATA16 has been recently associated with male fertility of dairy cattle [9]. In addition, the same work identified two coding variants highly associated with bull fertility in ENSBTAG00000019919, another gene captured by our AWM approach that is not fully annotated yet.
The 3-way search for key genes using network connectivity, pleiotropy score and spermspecific expression, allowed us to prioritise 29 transcription factors (TF) (Table 4, Fig 1) that were ranked in the top 10 positions in at least one of the three criteria. Six of these 29 genes were mapped to chromosome 5 including ARID2, DBX2, YEATS4, HMGA2, SMARCC2, SOX5 and NANOG5. The bovine chromosome 5 has been the subject of great scrutiny. For instance, a literature search on PubMed.gov using the string "bovine chromosome 5" results in 138 publications. Of these, it is worth remarking on the USDA work by McDaneld et al., [30]

PLOS ONE
who identified a deletion on chromosome 5 associated with reproductive efficiency in Bos indicus-influenced cattle. However, the association was at around~10Mbp away from the strongest SNP association here. It is unclear if it relates to the same or a closely located quantitative trait locus. Importantly, and specific for sheath score, Aguiar et al., [31] reported an association of copy number variation at intron 3 of HMGA2 with navel length in Bos indicus cattle. In the present study, the SNP in the coding region of HMGA2 (causal candidate: chr5:47,810,529) was found to have the strongest pleiotropic effect of all the SNP-genes in the AWM (chisquare = 327.99; P-value < 10 −16 ). The expression of HMGA2 distinguishes between different types of post-pubertal testicular germ cell tumours [32], thus, indicating that HMGA2 has a role in germ cell proliferation, which could affect testicular development in bulls. Moreover, from the 204 genes captured by the AWM for having SNPs associated with sheath score, 104 were located in chromosome 5. Within the network (S2 Fig in S1 File), most genes associated with sheath score create a well-defined cluster, reflecting their cooperative molecular role. Among those, we can highlight HELB (chr5:47,487,358), a gene identified in the context of selection signatures in tropical cattle and proven to segregate independently of the copy number variation HMGA2-CNV [33]. Earlier work also pointed at HELB as a candidate gene for regulating inhibin, produced by Sertoli cells, and shown to be an early biomarker for sexual development in tropical beef cattle [34]. More recently, Xiang et al. [35] reported a cluster of pleiotropic and functional variants on chromosome 5 associated with dairy cattle traits, and Alexandre et al. [36] described a potential indicine-specific peak of chromatin accessibility that agrees with a selective sweep at this genomic region. The second cluster of SNP-Genes captured by the AWM was located on chromosome X with 283 genes or 8.13% of the 3,479 in the AWM, compared to 24,058 in the entire panel of 680,758 SNP (or 3.53%, P-value < 0.01). Noting that most genes not shared between species are highly expressed in testis [37], suggested the X chromosome as a promising source of genes associated with spermatogenesis and male fertility. In our AWM, most of the X-mapped genes were associated with PNS, and the ones with the greatest number of connections, highest pleiotropic effect and most expressed in sperm were PLP1 (PNS, chrX:52,547,466), PAK3 (PNS, chrX:59,261,790) and ENSBTAG00000030490 (60S ribosomal protein L17, SHEATH, chrX:102,722,443), respectively. PLP1 was differentially expressed between fresh and frozenthawed sperm of Holstein bulls [38], with cryodamage being a major problem in semen cryopreservation, causing changes to sperm transcripts that may influence sperm function and motility. Thus, it is tempting to speculate the use of PLP1 as a biomarker of sperm quality or cryotolerance.
To further study the 29 transcription factor genes identified by our 3-way search (Fig 3A), we extracted AWM output (co-association scores) for those genes and hierarchically clustered them. Two major trait clusters were identified (Fig 3B): the first one mostly contained the traits measured in the bulls (WT, COND, SC, and SHEATH) plus PD and MP and the second included the other sperm traits (DENS, MASS, MOT and PNS). This separation between sperm traits is probably because for PD and MP the smaller the value the better, while the opposite is true for DENS, MASS, MOT and PNS. In terms of the genes, also two clusters can be observed potentially indicating different biological functions.
Finally, when we ran a functional enrichment analysis using the 29 key TF (S4 Table in S1 File), apart from finding expected enriched terms related to DNA binding, transcription, and chromatin remodelling, we also found "stem cell differentiation" (FDR = 4.51E-03) and "regulation of stem cell differentiation (FDR = 2.39E-02). The enrichment was due to five genes, namely NANOG (chr5: 101,425,382), HMGA2 (chr5: 47,810,529), RBPJ (chr6: 45,854,269), NKX2-1 (chr21: 46,713,013) and SOX5 (chr5: 86,055,295). Except for SOX5, they present similar expression profiles, falling in the upper cluster of Fig 3B. Importantly, those five genes were selected by different criteria: while NANOG and SOX5 were selected due to high expression in sperm cells, RBPJ and NKX2-1 were selected based on high connectivity, and HMGA2 was based on pleiotropy. In this sense, Fig 3C illustrates how the three selection criteria put forward different genes. Using this 3-way schema we were able to identify biologically relevant processes that could have been missed if using only one method had been used.
Stem cell differentiation is the process by which a cell is changed to a more specialized cell type. In the context of male fertility, spermatogonial stem cells go through a series of differentiation steps to become spermatozoa [39]. NANOG, a homeobox protein mostly associated with pluripotent cells, is also a key player in sperm cell differentiation of different mammalians [40]. It is not surprising then that it appears in the top 10 highly expressed genes in bovine sperm cells within our dataset. One positively regulated direct target of NANOG, at least in embryonic stem cells [41], is also within our key 29 TF. The Estrogen Related Receptor Beta (ESRRB) was shown to replace the NANOG function in pluripotent cells [41,42] and our dataset demonstrates its regulatory potential by being the fifth most connected TF in our network. HMGA2 is part of the family of high mobility group proteins and is characteristic of rapidly dividing cells such as embryonic tissue and tumours [43]. It has also been shown to play an important role in male fertility, as the impairment of its function in mice was shown to block spermatogenesis [44]. Another gene controlling sperm cell proliferation and differentiation is RBPJ (Recombination Signal Binding Protein For Immunoglobulin Kappa J Region) which seems to be required for the proper regulation of the testis stem cell microenvironment [45]. SOX5 is one of the SRYrelated box TF involved in steroidogenesis, spermatogenesis and sperm hyperactivation [46,47]. While is not our intent to discuss all genes associated with relevant SNPs in this work, the relevance of several of these genes in the context of male fertility is evident.
The observed genomic correlations between the same trait in different breeds imply similarities in the genetic background across breeds. However, this was not observed across most traits. We confirmed that male fertility-related traits GEBV can be estimated with useful accuracy in a multi-breed scenario and that higher accuracies are obtained when the target breed is included in the reference population. Finally, functional variants/genes indicated through a 3-ways prioritization strategy are proposed as markers to assist in selection for male fertility traits. This multi-breed approach could be further developed in the future, aiming at a broader adoption of the technology by the industry.

Animal care and use committee assessment
The total cattle population combined two research-based and four stud herds. The JM Rendel Laboratory Animal Experimentation Ethics Committee (CSIRO, Queensland, Australia) evaluated and approved the handling and sampling protocols of the two research-based herds (TBC107 and RH225-06, 1999RH225-06, -2006RH225-06, and 2006RH225-06, -2010. For the four stud herds, historical DNA samples were accessed directly from a parent verification laboratory after receiving the consent of the bull owner. These samples were collected by bull breeders and sent to their service provider laboratory to confirm the parentage of their calves. No sample was collected, or cattle were handled for this project. In consultation with our local Animal Care and Use Committee, accessing archived historical samples was considered to fall outside the scope of the committee's approval process.

Animals and phenotypes
A total of 6,063 bulls from six breeds with measures for 10 phenotype traits were sourced from two research populations and four stud herds (S1 Table in S1 File). The two research populations comprised 1,051 Brahman (BRM) and 1,819 Tropical Composite (TRC) bulls and were part of a larger population established by the Cooperative Research Centre for Beef Genetic Technologies (Beef CRC) to understand the genetic links between adaptability and components of herd profitability in northern Australia [23,48]. The four stud herds provided historical phenotypes and access to DNA (or hair) samples stored at a parent verification laboratory. The breeds represented by these four herds were Santa Gertrudis (SGT; N = 929 bulls), Droughtmaster (DMT; N = 760), Ultra Black (UBK; N = 844) and Belmont Tropical Composite (BTC; N = 660).
The 10 phenotypes included four measured on the bull and six measured on the bull's semen (S2 Table in S1 File). Phenotypes measured on the bull included: body weight (WT, kg), body condition score (COND, score 1 to 5), scrotal circumference (SC, cm) measured with a standard measuring tape and sheath score (SHEATH, score 1 to 5). Semen samples were collected by an experienced technician using electroejaculation. The phenotypes measured on the bull's semen included: the crush-side measurements of density (DENS, visual score 1 to 5), mass activity (MASS, visual score 1 to 5), progressive sperm motility (MOT, %), and the labbased measurements of the percent normal sperm (PNS, %), percent sperm with proximal droplets (PD, %) and percent midpiece abnormalities (MP, %). A single observation was taken for each trait on each bull. For the research populations (BRM and TRC), SC was measured at yearling, while the other parameters, were measured at the first time the bull produced a semen sample. For all other populations, the traits were recorded as part of their routine preparation of the animals for sale.
The age at which the phenotypes were observed varied across the breeds; for BRM and TRC cattle, the age at SC was around 360 d, and for Sheath and PNS around 700 d. For SGT and DMT all phenotypes were observed at around 600 d of age, while for UBK and BTC at around 440 d and 390 d of age, respectively.

Genotypes and genomic relationships
Most animals were genotyped using a commercial SNP chip with around~50,000 markers. Genotypes were imputed up to high density (~700K SNP) using a reference population that combined Beef CRC and industry cattle genotyped on the higher density platform, mapped onto the ARS_UCD 1.2 cattle genome assembly [49]. Genotypes were first phased using Eagle [50] and then imputed using Minimac3 (autosomes) or Minimac4 (BTAX) [51]. SNP with imputation r 2 > 0.8 were kept for further analyses. In total, 680,758 SNP were kept including 24,058 mapped on the X chromosome. We ascertained genomic structures based on principal components analyses (PCA) using PLINK1.9 [52] and computed genomic relationships employing within-and across-breed genomic relationship matrices (GRM) using Method 1 of VanRaden [53].

Estimation of genetic parameters and genomic predictions
The phenotypes were adjusted before the genomic analyses for non-genetic effects using SAS software, Version 9.4 for PC. Copyright © [2002-2012] SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA. The model for adjustment was as follows: y = Xβ + ε 1 where y is the vector of values from a given phenotype, X is an incidence matrix relating the phenotypes with the fixed effects in β, which included the effects of population, year of birth and cohort, and the covariates of age and the first two principal components, and ε 1 is the vector of random error assumed to be independent and normally distributed. The adjusted phenotypes (y � ) were analyzed using a fully random model with polygenic (u) and residual effects to obtain estimates of genetic parameters and predicted genomic-estimated breeding values (GEBV) using the multi-variate genomic-relatedness-based restricted maximum likelihood (GREML) approach as implemented in Qxpak5 [54]. To estimate the genetic correlation between pairs of traits for the combined dataset with six breeds, a total of 45 bi-variate GREML analyses were performed for all pair-wise phenotypes (i.e., each using a GRM of dimension 6,063). In addition, following analytical approaches described in [21], the genomic correlation for a given phenotype in two breeds was estimated by treating each phenotype as a different trait in each breed pair. As a result, we performed a total of 150 bi-variates GREML analyses (i.e., from 15 pairs across the 6 breeds times the 10 phenotypes) each with a GRM of dimension equal to the number of animals in the breed pair.

Quality assessment of genomic predictions
Traditional (or correlation-based) [55] and Method LR [22] approaches were used to estimate accuracy of GEBV. Using the second approach, bias and dispersion of GEBV were also estimated. Genomic estimates calculated using the whole dataset were defined as GEBV Whole (or u w ) and uni-variate models for cross-validation GEBV Partial (orû p ).
The following four metrics were employed: 1. Correlation-based Accuracy (ACC R ): In the context of cross-validation, the accuracy of a GEBV is traditionally computed from the Pearson correlation between a GEBV and the adjusted phenotype (y � ; phenotype y adjusted for fixed effects) for individuals in the validation population, and divided by the square root of heritability (h 2 ): Where F is the average inbreeding coefficient, 2f is the average relationship between individuals, and s 2 g;1 is the genetic variance at equilibrium in a population under selection. Assuming the individuals in the validation population are not under selection, s 2 g;1 can be estimated by the additive genetic variance estimated from the partial dataset.
3. Method LR Bias (Bias LR ): Difference between the average GEBV of individuals in the validation population using the partial data minus that using the whole data: In the absence of bias, the expected value of Bias LR is zero.
4. Method LR Dispersion (Disp LR ): For individuals in the validation population, dispersion was measured from the slope of the regression ofû w onû p : In the absence of bias, the expected value of Disp LR is 0. Values less than 0 indicate underdispersion (or deflation) ofû p intoû w as phenotypes become available. Values greater than 1 indicate over-dispersion (or inflation) ofû p intoû w .

SNP effects, Association Weight Matrix (AWM), pleiotropy and gene coassociation networks
We used the AWM methodology [56,57] to identify and select the SNP to define the input data to construct a co-association network inferred using the Partial Correlation and Information Theory (PCIT) algorithm [58]. In brief, the AWM is built by generating normalized SNP effects across all ten traits, associating SNP to genes based on distance within the genome and selecting SNP/genes based on their significance to the key trait (PNS) or pleiotropic effect. First, for each trait, SNP effects were estimated using derivations from [59,60] as follows: In whichŝ is the vector of estimated SNP effects of dimension 680,758 for as many SNP included in the analyses, λ is the ratio of SNP variance to genetic variance and assumed 0.85 throughout, M is the matrix of genotypes centred for observed allele frequencies with dimension equal to the number of animals (6,063) by number of SNP (680,758), G is the GRM computing using Method 1 of [53] across all animals and SNP, andû w is as defined earlier for the trait of interest. SNP effects were z-score standardized and p-values for the association of an SNP to the trait obtained from an inversed normal distribution (S1-S3 Data). Second, SNP were mapped to genes. We used linkage disequilibrium (LD) findings from [61] where at short distances between markers (< 10 kb), taurine breeds showed higher LD (r 2 = 0.45) than their indicine (r 2 = 0.25) and composite (r 2 = 0.32) counterparts. Therefore, we selected SNP located in the coding region or within 10 kb of an annotated gene based on the ARS_UCD1.2 bovine genome assembly [49]. Third, SNP (renamed by their linked gene) were included in the AWM if found to be either associated (p-value < 0.01) with the key phenotype (i.e. PNS) or pleiotropic. To capture pleiotropic SNP, we computed the average number of phenotypes to which the PNS-associated SNP were associated, namely N PNS , and selected the remaining SNP-Genes associated (p-value < 0.01) to more than N PNS phenotypes. To further characterize the pleiotropic potential of SNP we computed the multi-trait χ 2 statistic for the i-th SNP following derivations from [62]: i Whereŝ i is a 10 (number of traits) × 1 vector of z-score standardized effect of the i-th SNP and V −1 is the inverse of a 10 × 10 correlation matrix calculated overall estimated SNP effects. The χ 2 value of each SNP was assessed for significance based on a χ 2 distribution with 10 degrees of freedom to test against the null hypothesis that the SNP had no significant effect on any of the 10 traits. The final AWM contained as many rows as SNP (assigned to genes) and as many columns as traits, in this case ten traits.
To build the co-association network, we used the AWM as input, and identified significant gene-gene interactions using the PCIT algorithm [58] which calculates pairwise correlations between loci while accounting for the influence of a third locus. Unlike likelihood-based approaches, which invoke a parametric distribution (e.g., normal) assumed to hold under the null hypothesis and then a nominal p-value (e.g., 5%) used to ascertain significance, PCIT is an information theoretic approach. Its threshold is an informative metric; in this case, the partial correlation after exploring all trios in judging the significance of a given correlation, which might then become a connection when inferring a network. It thereby tests all possible 3-way combinations in a dataset and only keeps correlations between loci if they are significant and independent of another locus, whereas no hard threshold is set for the correlation strength. The significance threshold for each combination of loci depends on the average ratio of partial to direct correlations. Gene interactions were predicted using correlation analysis of the SNP effects across pairwise rows of the AWM. Hence, the AWM-predicted gene interactions are based on significant co-association between SNP. In the network, every node represents a gene (or SNP), whereas every edge connecting two nodes represents a significant gene-gene interaction (based on SNP-SNP co-association). Finally, the Cytoscape software [63] was used to visualize the gene network and the CentiScaPe plugin [64] was used to calculate specific node centrality values and network topology parameters. Functional enrichment analysis was performed using GeneMANIA [65].

Identification of potential functional variants
To flag functional variants a 3-way search for key genes was performed by ranking transcription factors based on network connectivity, pleiotropy score and sperm-specific expression. To do that, we accessed the Animal Transcription Factor Database (AnimalTFDB3.0; [66], on March 24, 2021) to download a set of 1,396 Bos taurus genes annotated as transcription factors and the Cattle Gene Atlas [67] to explore the sperm specificity of gene expression based on the sperm epigenomic study of Liu et al., [68]. Variants associated to genes ranked in the top 10 positions in at least one of the three criteria were considered potential functional variants.